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We study probability distributions over free algebras 
of trees. Probability distributio ns can be seen as partic- 
ular (formal power) tree series [BR82; EK03J, i.e. map- 
pings from trees to a semiring K. A widely studied class 
of tree series is the class of rational (or recognizable) 
tree series which can be defined either in an algebraic 
way or by means of multiplicity tree automata. We ar- 
gue that the algebraic representation is very convenient 
to model probability distributions over a free algebra of 
trees. First, as in the string case, the algebraic represen- 
tation allows to design learning algorithms for the whole 
class of probability distributions defined by rational tree 
series. Note that learning algorithms for rational tree se- 
ries correspond to learning algorithms for weighted tree 
automata where both the structure and the weights are 
learned. Second, the algebraic representation can be eas- 
ily extended to deal with unranked trees (like XML trees 
where a symbol may have an unbounded number of chil- 
dren). Both properties are particularly relevant for ap- 
plications: nondeterministic automata are required for 
the inference problem to be relevant (recall that Hidden 
Markov Models are equivalent to nondeterministic string 
automata); nowadays applications for Web Information 
Extraction, Web Services and document processing con- 
sider unranked trees. 

1 Representation Issues 

Trees, either ranked or unranked, arise in many appli- 
cation domains to model data. For instance XML docu- 
ments are unranked trees; in natural language process- 
ing (NLP), syntactic structure can often be considered 
as treelike. From a machine learning perspective, dealing 
with tree structured data often requires to design prob- 
ability distributions over sets of trees. This problem has 
been addressed mainly in the NLP comm unity w ith tools 
like probabilistic context free grammars |MS99j. 



Weighted tree automata and tree series are powerful 
tools to deal with tree structured data. In particular, 
probabilistic tree automata and stochastic series, which 
both define probability distributions on trees, allow to 
generalize usual techniques from probabilistic word au- 
tomata (or hidden markov models) and series. 

Tree Series and Weighted Tree Automata In 

these first two paragraphs, we only consider the case of 
ranked trees. A tree series is a mapping from the set of 
trees into some semiring K. Motivated by defining prob- 
ability distributions, we m ainly co nsider the case K = R. 
A recognizable tree series [BR82] S is defined by a finite 
dimensional vector space V over K, a mapping \i which 
maps every symbol of arity p into a multilinear mapping 
from V p into V (/i uniquely extends into a morphism 
from the set of trees into V), and a linear form A. S(t) 
is defined to be A(/i(t)). Tree series can also be defined 
by weighted tree automata (wta) . A wta A is a tree au- 
tomaton in which every rule is given a weight in K . For 
every run r on a tree t (computation of the automaton 
according to rules over t), a weight A(t,r) is computed 
multiplying weights of rules used in the run and the final 
weight of the state at the root of the tree. The weight 
A(t) is the sum of all A(t, r) for all runs r over t. 

For commutative semirings, recognizable tree series in 
the algebraic sense and in the automata sense coincide 
because there is an equivalence between summation at 
every step and summation over all runs. It can be shown, 
as in the string case, that the set of recognizable tree 
series defined by deterministic WTA is strictly included 
in the set of recognizable tree series. A Myhill-Nerode 
Theorem can be defined for WTA over fields ^Bor03j. 

Probability Distributions and Probabilistic Tree 
Automata A probability distribution S over trees is 
a tree series such that, for every t, S(t) is between 



and 1, and such that the sum of all S(t) is equal to 1. 
Probabilistic tree automata (pta) are WTA verifying nor- 
malization conditions over weights of rules and weights 
of final states. They extend probabilistic automata for 
strings and we recall that nondeterministic probabilistic 
string automata are equivalent to hidde n Markov models 
(hmms). As in the string case ^DEH06j, not all probabil- 
ity distributions defined by WTA can be defined by pta. 
However, we have proved that any distribution defined 
by a WTA with non-negative coefficients can defined by 
a pta, too. 

While in the string case, every probabilistic automa- 
ton defines a probability distribution, this is no longer 
true in the tree case. S imilarly to probabilistic context- 
free grammars |Wet80j, probabilistic automata may de- 
fine inconsistent (or improper) probability distributions: 
the probability of all trees is less than one. We have 
defined a sufficient condition for a pta to define a prob- 
ability distribution and a polynomial time algorithm for 
checking this condition. 

Towards unranked trees Until this point, we only 
have considered ranked trees. However, unranked trees 
can be expressed by ranked ones usin g an isomo rphism 
defined by an algebraic formulation f lCDG + 97l . chap- 
ter 8). It consists in using the right adjonction operator 
defined by f(ti, . . . ,i„_i)@t n = f{t\, . . . ,t n ); any tree 
can then be written as an expression whose only opera- 
tor is @, and thus as a binary tree: e.g., b(a,a,c(a,a)) 
corresponds to @(@(@(6, a), a), @(@(c, a), a)), wta for 
unranked trees can be defined as WTA for ranked trees 
applied to the algebraic formulation. We call such au- 
tomata weighted stepwise tree automata (wsta). 

Hedge automata are automata f or unrank ed trees. 
Each rule of a hedge automaton ICDG + 97l is writ- 
ten f(L) — > q where L is a regular language of word 
with the set of states of the automata as its alphabet. 
For weighted hedge automata (wha) , the weight of the 
rule f(u) — ^ g is the product of a weight given to the 
whole rule /(£) — » q and the weight of u according 
to a weighted word automata associated to f(L) — ► q. 
When K is commutative, WSTA and WHA define the same 
weight distributions on unranked trees. 

Probabilistic hedge automata can be defined by adding 
the same kind of summation conditions than on WHA, 
but it has yet to be shown that they can be expressed 
by PTA through algebraic formulation. We don't know 
yet weither defining series on unranked trees directly is 
possible, although it can be achieved using the algebraic 
formulation. 

2 Learning Probability Distributions 

Inference and Training PTA can be considered as 
generative models for trees. The two classical inference 
problems are : given a pta A and given a tree t, com- 
pute p(t) which is defined to the sum over all of all p(t , r) ; 
and given a tree t, find the most likely (or Viterbi) la- 
beling (run) f for t, i.e. compute f = argmax r p(r|t). It 



should be noted that the inference problems are relevant 
only for nondeterministic pta. The training problem 
is: given a sample set S of trees and a pta, learn the best 
real-valued parameter vector (weights assigned to rules 
and to states) according to some criteria. For instance, 
the likelihood of the sample set or the likelihood of the 
sample over Viterbi derivations. Classical algorithms for 
inference (the message passing algorithm) and learning 
(the Baum- Welch algorithm) can be designed for pta 
over ranked trees and unranked trees. 



Learning Weighted Automata The learning 
problem extends over the training problem. Indeed, for 
the training problem, the structure of the pta is given 
by the set of rules and only weights have to be found. 
In the learning problem, the structure of the target au- 
tomaton is unknown. The learning problem is: given a 
sample set S of trees drawn according to a target rational 
probability distribution, learn a WTA according to some 
criteria. If the probability distribution is defined by a de- 
terministic pta, a learning algorithm extending over the 
unweighted case has been defined in ICOCROTI . How- 
ever, this algorithm works only for deterministic pta. 
We recall that the class of probability distributions de- 
fined by deterministic PTA is strictly include d in the class 
of probability distributions defined by pta [Bor03j . 

Learning Recognizable Tree Series and thus learn- 
ing wta can be achieved tha nks to an algorithm pro- 
posed by Denis and Habrard IDH07I . This algorithm, 
which benefits from the existence of a canonical linear 
representation of series, can be applied to series which 
take their values in E or Q to learn stochastic tree lan- 
guages. It should be noted that the algebraic view al- 
lows to learn probability distributions defined by nonde- 
terministic wta. Learning probability distributions for 
unranked trees is ongoing work. 
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